[SPARK-55793][CORE] Add multiple log directories support to SHS#54575
[SPARK-55793][CORE] Add multiple log directories support to SHS#54575sarutak wants to merge 9 commits intoapache:masterfrom
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Thank you, @sarutak .
I understand prod and staging use cases.
What happens when there exists a conflict among the log directories? For example, a user want to abuse this as a kind of multi-tier log managements like the following and copy from shorterm to longterm? Of course, the sync operation is non-atomic.
- hdfs://spark-events/shorterm
- hdfs://spark-events/longterm
What is the semantic on the ordering in the config value? Especially, when we have SPARK-52914 ?
|
Could you fix the CI failures? |
|
@dongjoon-hyun Thank you for your interest.
Each event log file is tracked by its full path as the key in
The ordering of directories in the config value has no semantic. All directories are scanned equally in each polling cycle ( On-demand loading operates per log file within |
|
Thank you. This is a nice feature. I'll try to test more seriously. |
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
We need more clear definition between this and the existing spark.history.fs.* configuration. At the first glance,
- Do you want to have per-directory configurations in the future?
- For now,
spark.history.fs.update.intervalis supposed to be applied for one scan for all directories? spark.history.fs.cleaner.intervalis also supposed to be applied for one scan for all directories?- When
spark.history.fs.cleaner.maxNumis applied,- This PR will consider the total number of files for all directories, right?
- Which directory will be selected as a victim for the tie?
Since this introduces lots of ambiguity a little, could you revise the PR title and provide a corresponding documentation update, docs, together in this PR?
|
@dongjoon-hyun Thank you for your feedback.
I considered it might be helpful to have per-directory configurations (e.g.
Yes.
Yes.
Yes, the property is applied to the total number of log entries across all directories. As the updated document says, when the limit is exceeded, the oldest completed attempts are deleted first regardless of which directory they belong to.
Updated (You said |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Thank you for updating.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, I supported @sarutak 's proposal and this PR's approach. Thank you.
cc @mridulm , @yaooqinn , @LuciferYang , too.
core/src/main/resources/org/apache/spark/ui/static/historypage-template.html
Outdated
Show resolved
Hide resolved
core/src/main/resources/org/apache/spark/ui/static/historypage-template.html
Show resolved
Hide resolved
core/src/main/resources/org/apache/spark/ui/static/historypage-template.html
Show resolved
Hide resolved
|
Thanks @yaooqinn for your feedback. I've updated. |
|
@sarutak I found one potential correctness issue: In This may be ambiguous when two configured directories contain the same event log basename (e.g. after migration/copy or multi-cluster aggregation,although the probability is very low.). In that case, download/rebuild/cleanup may operate on a different physical file than the one originally indexed. For example, the following test case: In a scenario where multiple log directories are configured in SHS and there are event log files with the same name in different directories, I expect the log path of app2 to be resolved to the second directory where it actually resides. However, in practice, So can we assume that filenames are globally unique across configured directories? |
|
@LuciferYang Thank you for pointing it out. While event log filenames are typically unique in normal operation, duplicates can occur during migration or multi-cluster log aggregation.
I also added a test based on what you show. |
f1b6e0d to
68272d7
Compare
| try { | ||
| checkForLogsInDir(dir, newLastScanTime, allNotStale) | ||
| } catch { | ||
| case e: Exception => |
There was a problem hiding this comment.
Can we catch more narrow and specific exception instead of Exception?
There was a problem hiding this comment.
Can we have some test coverage for the following cases?
- What happens when the one of directories is removed while SHS is running for a while?
- What happens when the one of directories is not created at all before SHS starts? And, it's supposed to be created after SHS is running.
For example, I'm thinking about MONTHLY LOG DIRECTORY scenarios.
s3://spark-events/2026/04/
s3://spark-events/2026/05/
It's not a blocker. We can consider the above as a new JIRA issue, @sarutak .
|
@dongjoon-hyun Thank you for your suggestion. I'll open a PR for some more test coverage. |
|
Thank you! Feel free to merge this PR first so that we can get more feedbacks from the community, @sarutak . |
|
Merged to |
…ories feature ### What changes were proposed in this pull request? This PR proposes to add more tests for SHS multiple log directories feature added in SPARK-55793 (#54575). New tests include: - **directory removed while SHS is running** — verifies that removing a log directory at runtime does not crash the scan and apps from remaining directories are still listed - **directory does not exist at startup but created later** — verifies that a directory that doesn't exist at startup is picked up on subsequent scans (monthly directory scenario) - **directory temporarily inaccessible then recovers** — verifies that apps reappear after a temporarily inaccessible directory is restored - **all directories inaccessible does not crash** — verifies graceful handling when all configured directories become unavailable - **config with empty entries between commas** — verifies that empty entries in `spark.history.fs.logDirectory` (e.g., `dir1,,dir2`) are handled correctly - **logDirectory.names count mismatch falls back to full paths** — verifies that when the number of names doesn't match the number of directories, display names fall back to full paths ### Why are the changes needed? For better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed that all new tests passed. ``` $ build/sbt 'testOnly org.apache.spark.deploy.history.RocksDBBackendFsHistoryProviderSuite' ``` ### Was this patch authored or co-authored using generative AI tooling? Kiro CLI / Opus 4.6 Closes #54660 from sarutak/shs-multi-log-dirs-more-tests. Authored-by: Kousuke Saruta <sarutak@amazon.co.jp> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR proposes to add multiple log directories support to SHS, allowing it to monitor event logs from multiple directories simultaneously.
This PR extends
spark.history.fs.logDirectoryto accept a comma-separated list of directories (e.g.,hdfs:///logs/prod,s3a://bucket/logs/staging). Directories can be on the same or different filesystems. Also, a new optional configspark.history.fs.logDirectory.namesis added which allows users to assign display names to directories by position (e.g.,Production,Staging). Empty entries fall back to the full path. Duplicate display names are rejected at startup.Behavior of existing
spark.history.fs.*settings with multiple directories:All existing settings apply globally — there are no per-directory configurations.
update.intervalcleaner.intervalcleaner.maxAgecleaner.maxNumnumReplayThreadsnumCompactThreadseventLog.rolling.maxFilesToRetainupdate.batchSizeRegarding UI changes, a "Log Source" column is added to the History UI table showing the display name (or full path) for each application, with a tooltip showing the full path.
Regarding UI changes, A "Log Source" column is added to the History UI table showing the display name (or full path) for each application, with a tooltip showing the full path.

Users can filter applications by their log directory using

Filter by Log Sourcedropdown.The


Event log directorysection in the History UI collapses into a<details>/<summary>element when multiple directories are configured.Why are the changes needed?
Some organization run multiple clusters and have corresponding log directory for each cluster. So if SHS supports multiple log directories, it can be used as a single end point to view event logs, which helps such organizations.
Does this PR introduce any user-facing change?
Yes but will not affect existing users.
How was this patch tested?
Manually confirmed WebUI as screenshots above and added new tests.
Was this patch authored or co-authored using generative AI tooling?
Kiro CLI / Opus 4.6